Introduction

This project studies Netflix’s coverage of movies and TV shows, containing two parts: the general trends as well as summary text analysis.

Analysis

NLP Topics

Beyond examining the distribution of production on Netflix TV series and Movies, this project will aim to provide insight on key characteristics that contribute to being a “high rated” show (including both TV series and Movies). More specifically, the second part of the project will utilize natural language processing to look at word cloud for show summaries, compare most frequently appeared words between all shows and high rated shows, compare Flesch Kincaid score with ratings, as well as with the subcategories of TV series and Movies. The purpose of this project is to provide an overview of the dynamic on Netflix and possible suggestions that could increase a shows probability of having a high IMDb score.

Summaries Word Cloud

First, we decide to define a high IMDb rating standard.

##     Title              Genre               Tags            Languages        
##  Length:15480       Length:15480       Length:15480       Length:15480      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Series.or.Movie    Hidden.Gem.Score Country.Availability   Runtime         
##  Length:15480       Min.   :0.600    Length:15480         Length:15480      
##  Class :character   1st Qu.:3.800    Class :character     Class :character  
##  Mode  :character   Median :6.800    Mode  :character     Mode  :character  
##                     Mean   :5.938                                           
##                     3rd Qu.:7.900                                           
##                     Max.   :9.800                                           
##                     NA's   :2101                                            
##    Director            Writer             Actors          View.Rating       
##  Length:15480       Length:15480       Length:15480       Length:15480      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    IMDb.Score    Rotten.Tomatoes.Score Metacritic.Score Awards.Received  
##  Min.   :1.000   Min.   :  0.00        Min.   :  5.00   Min.   :  1.000  
##  1st Qu.:5.800   1st Qu.: 38.00        1st Qu.: 44.00   1st Qu.:  1.000  
##  Median :6.600   Median : 64.00        Median : 57.00   Median :  3.000  
##  Mean   :6.496   Mean   : 59.52        Mean   : 56.81   Mean   :  8.764  
##  3rd Qu.:7.300   3rd Qu.: 83.00        3rd Qu.: 70.00   3rd Qu.:  8.000  
##  Max.   :9.700   Max.   :100.00        Max.   :100.00   Max.   :300.000  
##  NA's   :2099    NA's   :9098          NA's   :11144    NA's   :9405     
##  Awards.Nominated.For  Boxoffice         Release.Date      
##  Min.   :  1.00       Length:15480       Length:15480      
##  1st Qu.:  2.00       Class :character   Class :character  
##  Median :  5.00       Mode  :character   Mode  :character  
##  Mean   : 13.98                                            
##  3rd Qu.: 12.00                                            
##  Max.   :386.00                                            
##  NA's   :7819                                              
##  Netflix.Release.Date Production.House   Netflix.Link        IMDb.Link        
##  Length:15480         Length:15480       Length:15480       Length:15480      
##  Class :character     Class :character   Class :character   Class :character  
##  Mode  :character     Mode  :character   Mode  :character   Mode  :character  
##                                                                               
##                                                                               
##                                                                               
##                                                                               
##    Summary            IMDb.Votes           Image              Poster         
##  Length:15480       Min.   :      5.0   Length:15480       Length:15480      
##  Class :character   1st Qu.:    403.5   Class :character   Class :character  
##  Mode  :character   Median :   2322.0   Mode  :character   Mode  :character  
##                     Mean   :  42728.4                                        
##                     3rd Qu.:  20890.5                                        
##                     Max.   :2354197.0                                        
##                     NA's   :2101                                             
##  TMDb.Trailer       Trailer.Site      
##  Length:15480       Length:15480      
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
## 

Through the summary as well as the boxplot, I decide to use 7.3 as a standard for high IMDb score, because it is above 3rd percentile.

All shows

After all the necessary steps such as cleaning and stemming, we create a word cloud that shows all shows summaries on Netflix. The most common word used is “life”.

High rating shows

When only looking at shows with high rating, the word cloud appears to be similar as the graph before, which looks at all shows. “life” is still the most frequently appeared word, with a few other words overlapping, such as “young”, “friend”, “love”,“man”. The only difference between the top 10 common words is that “woman” is top 10 common in all shows, while “world” is top 10 common in high rating shows. This might suggest that the shows not mentioning “woman” in the summary are more likely to receive higher score. Taking account that according to IMDb’s official statistics that show breakdown of ratings by gender, almost every show receives about 5 times more rating counts from male users, the platform is extremely male orientated. This word cloud might suggest that because the platform have more male users, these users tend to give high score for shows that doesn’t show the word “woman” and prefer non-feminine words like “world”.

Series and Movies comparison

Then, we decide to create a comparison and a commonality cloud showing the most-frequent series and movies words

Words like “life”, “young”, “family”, “love”, and “new” are high frequent and common words among both series and movie summaries.

Movie summaries have more pronouns, such as “girlfriend”, “mother”, “son”, “wife”, and “father”; while series summaries tend to have verbs, such as “follow”, “navigate”, “show”, “explore”, and “host”.

Summary Words and Ratings

We then created a pyramid plot to show the words between movies’ description with high rating and low rating differ in frequency

## 1137 1137

Due to there are more low rating shows than high rating shows, high rating shows naturally have less word frequencies. So when high rating shows have specific words that appear more than low rating shows, the percentage is significantly larger. Thus it can be said that for words like “footage”,“python”, “chronicle”, are more likley to appear in high rating shows.

Summary Readability & Ratings

##   document Flesch Flesch.Kincaid meanSentenceLength meanWordSyllables
## 1    text1  51.25          11.53               21.8             1.577

Knowing that the higher the FRE score, the easier to understand; and the lower the FRE score, the harder to understand. A 0-30 score range is usually for college graduates, which are very difficult to read and best understood by university graduates.

Treating all summary text as a single document, we receive a 11.53 Flesch Kincaid Score, 21.8 Average Sentence Length for each summary and 1.577 Average Word Syllables.

FRE vs. IMDb Score

## NULL

From the graph above it seems like there is no obvious correlation between Flesch Kincaid score and IMDb Score as most of the Flesch Kincaid score is concentrated in the middle. However it could be seen that when the IMDb score is over 6, a few Flesch Kincaid score appears out of the crowed to be higher than average. The phenomenon, yet, diminishes when the IMDb score is above 9.

A useful insight from this graph is that for low rating shows below 3, the Flesch Kincaid score is for sure below 15. So having higher Flesch Kincaid score will increase the chance of receiving IMDb score above 3, however it doesn’t garuantee how high the IMDb score can reach.

In order to draw more insights from this relationship, we decide to take a closer look by categorizing movies and series. Of the movies and series with a 10-15 FRE scores, it is more likely for them to receive a higher IMDb Score.

FRE vs. IMDb Votes

The 10-15 range of summary FRE score tend to have the potential to receive a high IMDb Votes. It is also fair to say that high quality production movies or series tend to write their summaries at a range of 10-15, which is for college graduates level.

FRE vs. High Ratings

## NULL

For these high rating shows, the Flesch Kincaid score ranges a lot. Interestingly, the Flesch Kincaid score starts to decrease when the IMDb score is above 8.5. The lower limit for shows with rating above 9 also starts to increase, shrinking the range of Flesch Kincaid score. It can be interpreted that Flesch Kincaid score ranging from 7~17 is more likely to receive IMDb score above 9.

FRE vs. Release Date

Towards recent years, the FRE Scores start to expand: from largely around 10-15 in year 1980 to a range of 5-10 in year 2020.

Conclusions

To conclude on the second part of the project, a few characteristics that are associated with high rating shows on Netflix can be summarized as following: The summary of the show is less feminine without the word “women”, include words such as “documentary”, “footage”, “chronicle”, “python”, have Flesch Kincaid score above 15 to receive above average IMDb score and below 17 to have a high IMDb score, and movies with high Flesch Kincaid score are more likely to receive a high IMDb score. These information could be utilized by future producers to gain higher ratings on Netflix.

Citation

code reference: http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know